Stabilize benchmark signal with variance-aware sampling and rolling baselines by Copilot · Pull Request #4622 · Azure/typespec-azure

Copilot · 2026-06-12T00:35:24Z

Benchmark results were overly sensitive to GitHub runner noise, causing high run-to-run deviation and unreliable PR deltas. This change makes benchmark comparisons more stable by increasing sample quality, gating on variance, and comparing against a rolling mainline baseline instead of a single latest run.

Sampling and runner stability
- Benchmark workflow now uses higher-fidelity defaults (warmup=3, iterations=25).
- Added configurable benchmark runner selection (workflow input + repo variable) to support dedicated/stable runners.
- Pinned Node version in benchmark workflow setup to reduce environment drift.
Variance-aware execution (noise gate)
- Added per-spec runtime variability statistics (mean/median/stddev/CV/min/max/sample count).
- Introduced optional noise-gating reruns when total runtime CV exceeds threshold:
  - --noise-cv-threshold
  - --max-reruns
  - --rerun-iterations
- Runner now records whether reruns were triggered and how many were performed.
Rolling baseline for PR comparisons
- PR comment baseline now prefers a rolling aggregate over recent main history (results/history.json) with fallback to results/latest.json.
- Added --baseline-window to control rolling window size.
- Baseline labeling now distinguishes synthetic rolling baselines from commit SHAs.
Benchmark output and docs updates
- Added shared statistics utilities for variability calculations.
- Updated benchmark summaries/comments to surface variability context.
- Updated benchmark README and tests to cover new CLI/options and formatting behavior.

node packages/benchmark/dist/src/cli.js run \
  --iterations 25 \
  --warmup 3 \
  --noise-cv-threshold 0.08 \
  --max-reruns 1 \
  --rerun-iterations 10 \
  --output /tmp/benchmark-results.json

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

azure-sdk · 2026-06-12T00:43:37Z

No changes needing a change description found.

azure-sdk · 2026-06-12T00:52:35Z

You can try these changes here

🛝 Playground	🌐 Website

timotheeguerin · 2026-06-15T14:47:26Z

@copilot fix format, spellcheck and run benchmark ci

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot · 2026-06-15T15:23:26Z

@copilot fix format, spellcheck and run benchmark ci

Addressed in commit c226b35.

Fixed formatting issues (prettier --check now passes).
Resolved spellcheck by renaming unflattenRuntime to expandRuntimeMetrics.
Ran benchmark CI-equivalent checks:
- pnpm -r --filter "@azure-tools/typespec-benchmark..." build
- pnpm --filter @azure-tools/typespec-benchmark test
- benchmark run command with CI flags completed successfully.

timotheeguerin · 2026-06-15T15:58:14Z

@copilot run benchmark workflow still failing

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot · 2026-06-15T16:05:35Z

@copilot run benchmark workflow still failing

Fixed in commit 093bcfe.

Root cause was Node engine mismatch in the benchmark workflow (pnpm install failed with ERR_PNPM_UNSUPPORTED_ENGINE on Node 24.11.1). I updated .github/workflows/benchmark.yml to use Node 24.15.0, which satisfies the dependency engine requirement.

timotheeguerin · 2026-06-24T15:22:19Z

@copilot benchmark run into OOM


  Benchmarking: azure-core-dataplane
    Warmup 1/3...
    Warmup 2/3...
    Warmup 3/3...
    Iteration 1/25...
    Iteration 2/25...
    Iteration 3/25...
    Iteration 4/25...
    Iteration 5/25...
    Iteration 6/25...
    Iteration 7/25...
    Iteration 8/25...

<--- Last few GCs --->

[4813:0x9db3000]   429954 ms: Scavenge (interleaved) 4048.6 (4076.5) -> 4041.3 (4114.7) MB, pooled: 0 MB, 34.43 / 0.00 ms  (average mu = 0.217, current mu = 0.229) allocation failure; 
[4813:0x9db3000]   431946 ms: Mark-Compact (reduce) 4062.8 (4116.2) -> 4047.8 (4061.7) MB, pooled: 0 MB, 145.02 / 0.00 ms  (+ 1717.1 ms in 328 steps since start of marking, biggest step 8.1 ms, walltime since start of marking 1992 ms) (average mu = 0.211,
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
----- Native stack trace -----

 1: 0x73f8c4 node::OOMErrorHandler(char const*, v8::OOMDetails const&) [node]
 2: 0xc06f90  [node]
 3: 0xc0707f  [node]
 4: 0xeaa885  [node]
 5: 0xeaa8b2  [node]
 6: 0xeaabaa  [node]
 7: 0xebb8aa  [node]
 8: 0xebfc50  [node]
 9: 0x1953f71  [node]
/home/runner/work/_temp/a4a90c35-7d83-4657-bd54-f22379613fdf.sh: line 9:  4813 Aborted                 (core dumped) node packages/benchmark/dist/src/cli.js run --specs-dir packages/benchmark/specs --iterations 25 --warmup 3 --noise-cv-threshold 0.08 --max-reruns 1 --rerun-iterations 10 --commit 6f075c146cd770a97dca3d420e095e7220bdeb57 --output /tmp/benchmark-results.json
Error: Process completed with exit code 134.

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot · 2026-06-24T15:31:54Z

@copilot benchmark run into OOM

...

Fixed in 1248489.

I updated the benchmark workflow to run the benchmark CLI with a larger Node heap (--max-old-space-size=6144) for both run and backfill, which addresses the OOM seen during high-iteration runs.

timotheeguerin · 2026-06-24T16:44:14Z

@copilot still OOM, increasing memory doesn't seem to be the right solution anyway, what is causing this increase in use, is there a memory leak with an emitter or the benchmark code?

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot · 2026-06-24T17:18:06Z

@copilot still OOM, increasing memory doesn't seem to be the right solution anyway, what is causing this increase in use, is there a memory leak with an emitter or the benchmark code?

Root cause is memory accumulation in the benchmark process, not a single emitter regression: repeated in-process compile() calls keep heap growth across iterations/specs until the run OOMs. I fixed this by isolating each iteration compile in a short-lived child process so memory is reclaimed between iterations while keeping the same benchmark output/metrics (3f2a5f9). I also validated benchmark build/tests and reran the high-iteration benchmark command successfully.

Copilot AI and others added 2 commits June 11, 2026 21:43

Implement benchmark variance and rolling baseline

468b5e6

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Address benchmark review follow-ups

e6f1059

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot AI assigned Copilot and timotheeguerin Jun 12, 2026

Copilot created this pull request from a session on behalf of timotheeguerin June 12, 2026 00:35 View session

microsoft-github-policy-service Bot added the eng label Jun 12, 2026

timotheeguerin marked this pull request as ready for review June 12, 2026 00:41

timotheeguerin requested review from bterlson, markcowl and timotheeguerin as code owners June 12, 2026 00:42

Copilot started work on behalf of timotheeguerin June 15, 2026 14:48 View session

Fix benchmark formatting and spellcheck follow-up

c226b35

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot finished work on behalf of timotheeguerin June 15, 2026 15:24

Copilot started work on behalf of timotheeguerin June 15, 2026 15:58 View session

Fix benchmark workflow Node engine mismatch

093bcfe

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot finished work on behalf of timotheeguerin June 15, 2026 16:06

markcowl approved these changes Jun 23, 2026

View reviewed changes

Comment thread packages/benchmark/src/statistics.ts

Copilot started work on behalf of timotheeguerin June 24, 2026 15:23 View session

Increase benchmark workflow Node heap limit to prevent OOM

1248489

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot finished work on behalf of timotheeguerin June 24, 2026 15:32

Copilot started work on behalf of timotheeguerin June 24, 2026 16:45 View session

Copilot AI and others added 2 commits June 24, 2026 17:12

Isolate benchmark iterations to prevent benchmark-run OOM

e2284a1

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Address benchmark review feedback on process output handling

3f2a5f9

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot finished work on behalf of timotheeguerin June 24, 2026 17:18

timotheeguerin enabled auto-merge June 30, 2026 14:55

timotheeguerin added this pull request to the merge queue Jun 30, 2026

Merged via the queue into main with commit 8d517e0 Jun 30, 2026
22 checks passed

timotheeguerin deleted the copilot/improve-benchmark-accuracy branch June 30, 2026 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stabilize benchmark signal with variance-aware sampling and rolling baselines#4622

Stabilize benchmark signal with variance-aware sampling and rolling baselines#4622
timotheeguerin merged 7 commits into
mainfrom
copilot/improve-benchmark-accuracy

Copilot AI commented Jun 12, 2026

Uh oh!

azure-sdk commented Jun 12, 2026

Uh oh!

azure-sdk commented Jun 12, 2026

Uh oh!

timotheeguerin commented Jun 15, 2026

Uh oh!

Copilot AI commented Jun 15, 2026

Uh oh!

timotheeguerin commented Jun 15, 2026

Uh oh!

Copilot AI commented Jun 15, 2026

Uh oh!

Uh oh!

timotheeguerin commented Jun 24, 2026

Uh oh!

Copilot AI commented Jun 24, 2026

Uh oh!

timotheeguerin commented Jun 24, 2026

Uh oh!

Copilot AI commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Copilot AI commented Jun 12, 2026

Uh oh!

azure-sdk commented Jun 12, 2026

Uh oh!

azure-sdk commented Jun 12, 2026

Uh oh!

timotheeguerin commented Jun 15, 2026

Uh oh!

Copilot AI commented Jun 15, 2026

Uh oh!

timotheeguerin commented Jun 15, 2026

Uh oh!

Copilot AI commented Jun 15, 2026

Uh oh!

Uh oh!

timotheeguerin commented Jun 24, 2026

Uh oh!

Copilot AI commented Jun 24, 2026

Uh oh!

timotheeguerin commented Jun 24, 2026

Uh oh!

Copilot AI commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants